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Exploring term-document matrices from matrix models in text mining 

Ioannis Antonellis * Efstratios Gallopoulos * 



Abstract 

We explore a matrix-space model, that is a natural extension to the 
vector space model for Information Retrieval. Each document can 
be represented by a matrix that is based on document extracts (e.g. 
sentences, paragraphs, sections). We focus on the performance of 
this model for the specific case in which documents are originally 
represented as term-by-sentence matrices. We use the singular 
value decomposition to approximate the term-by-sentence matrices 
and assemble these results to form the pseudo-"term-document" 
matrix that forms the basis of a text mining method alternative 
to traditional VSM and LSI. We investigate the singular values of 
this matrix and provide experimental evidence suggesting that the 
method can be particularly effective in terms of accuracy for text 
collections with multi-topic documents, such as web pages with 
news. 

1 Introduction 

The vector space model (VSM), introduced by Salton |20|, 
is one of the oldest and most extensively studied models for 
text mining. This is so because it permits using theories and 
tools from the area of linear algebra along with a number of 
heuristics. A collection of n documents is represented by a 
term-by-document matrix (tdm) of n columns and m rows, 
where m is the number of terms used to index the collection. 
Each element of the matrix is a suitable measure of the 
importance of term i with respect to the document and the 
entire collection. Although numerous alternative weighting 
schemes have been proposed and extensively studied, there 
are some well-documented weaknesses that have motivated 
the development of new methods building on VSM. The 
best known is Latent Semantic Indexing (LSI) 1 10 1, where 
the column space of the tdm is approximated by a space of 
much smaller dimension that is obtained from the leading 
singular vectors of the matrix. The model is frequently found 
to be very effective even though the analysis of its success 
in not as straightforward 1 1 8 1 . The computational kernel 
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in LSI is the singular value decomposition (SVD) applied 
on the tdm. This provides the mechanism for projecting 
data onto a lower, ^-dimensional space spanned by the k 
leading left singular vectors; cf. the exposition in fl1l5l ll2l . 
In addition to performing dimensionality reduction, LSI 
captures hidden semantic structure in the data and resolves 
problems caused in VSM by synonymy and polysemy. A 
well-known difficulty with LSI is the cost of the SVD for the 
large, sparse tdm's appearing in practice. This complicates 
not only the original approximation but also the updating 
of the tdm whenever new documents are to be added or 
removed from the original document collection. These are 
obstacles to the application of LSI on very large tdm's, so 
several efforts in the area are directed towards alleviating 
this cost. These range from techniques for lowering the 
cost of the (partial) SVD (e.g. exploiting sparse matrix 
technology and fast iterative methods, cf. I3ll7l l23h . to the 
application of randomized techniques ([l] II li t specifically 
targeting very large tdm's. One approach that appears to 
be promising is to approximate the tdm by operating on 
groups of documents that either arise naturally (e.g. because 
the documents reside at distant locations) or as a result of 
clustering GD[l3]|25]. It was shown in |25|, for example, 
that by clustering and then using few top left singular vectors 
of the tdm corresponding to each cluster could lead to 
economical and effective approximation of the tdm. 

In this "work in progress", we explore a family of text 
mining models arising as a natural extension of VSM and 
present cases where they appear to be able to capture more 
information about text documents and their structure. Our 
starting point is that the tdm utilized in VSM and LSI has 
no "memory" how it was constructed; in particular, any 
of the tdm columns can be decomposed in an unlimited 
number of ways as linear combination of other vectors. We 
can, however, express each document vector as the sum of 
vectors resulting from the document terms appearing at a 
selected level of the document's hierarchical structure (e.g. 
sections, paragraphs, sentences etc.) We can thus consider 
each of the document vectors to be the product of a "term- 
by-extract unit" matrix with a vector of all l's. Based 
on this, it is a natural next step to consider approximating 
each term-document vector. We would be loosely referring 
to the general idea as Matrix Space Model (MSM). MSM 
permits us to capture suitable decompositions of document 



vectors based on the document's hierarchical structure (into 
sections, paragraphs, sentences etc.) and store them into 
a matrix. To explore the model's properties, we study a 
specific instance based upon document decomposition into 
sentences. Sentence based decompositions have already 
been applied in text classification and summarization |2 
|H1 El 1241 . therefore the analysis we provide is also of 
independent interest. We also note an elegant recent proposal 
for a matrix-based IR framework close, but not the same, 
as ours in 1 19 1 as well as another phrase-based framework 
1 14 1 for clustering of semi-structured Web documents. We 
discuss these approaches later in this paper. As will become 
apparent in the sequel, one common useful feature of MSM- 
type models is that they can readily lead to the tdm of the 
original VSM. 

Based on the representation of the tdm as a matrix 
whose columns are obtained by multiplying a "term-by- 
extract unit" matrix with a vector of all 1 's, we approximate 
each column based on this decomposition. It is worth 
noting that our proposed approach has an analogue in the 
numerical solution of partial differential equations, namely 
domain decomposition techniques based on substructuring 
l6l . These are powerful tools that also lend themselves to 
parallel processing. 

The rest of the paper is structured as follows. In Sec- 
tion |2] we describe the matrix space models and show their 
relation with classic VSM and its variants. We also provide 
a formal study of the text analysis using document decom- 
position into its sentences and introduce formal definitions 
for the term-by-sentence and other matrices useful for MSM. 
Based on these, we describe a general IR method based on 
this approach and specify its use for the case of sentence- 
based analysis. In Section|3]we analyze the method and its 
relevant costs, and derive spectral information for the matrix 
underlying the IR strategy. Section |4] presents our experi- 
mental analysis. Finally in Section[5]we give our conclusions 
and future directions. 

Throughout the paper, we use pseudo-MATLAB nota- 
tion. We would be referring to the j-th column of any matrix 
A as fl ; -, so that aj = Aej, where ej denotes the j-th column 
of an (appropriately sized) indentity matrix /. We will also 
represent A as 

A = [m,. ...... a m ]. 

and use bestjt (A) to denote its best mnk-k approximation. 
We will use e to refer to the vector of all l's, whose 
size is assumed to be appropriate for the computation to 
be valid. Given scalars (square submatrices) §j, we use 
diag [(j) i , ...,<])/] to denote the corresponding diagonal (block 
diagonal) matrix. When the need arises (e.g. two "e"-vectors 
of different dimension in the same formula) a superscript will 
be used to show the difference, e.g. gW. 



2 Matrix space models for IR 

In the VSM, each document is represented using an m- 
dimensional vector. Each of the m dimensions refers to an 
indexing term and each coordinate of the vector is computed 
using some combination of a local and/or global weighting 
scheme. Weighting schemes can be seen as heuristics that 
help eliminate problems arising from the non-orthogonality 
of the different indexing terms and have been proven to be 
efficient for improving "precision" and "recall". 

We next observe that for a vector representation of a 
document, there are unlimited decompositions into a (given) 
number of components. Using the vector space model, such 
components could be seen as different "concepts", which 
combined, generate the concept of the given document. In 
fact, some reasonable decompositions would create the com- 
ponents using consecutive document's extracts. As VSM 
stores only the final vector of each document, it is obvi- 
ous that it doesn't exploit such kind of extra information. 
The goal of MSM's is to utilize meaningful document de- 
compositions that are based upon its structure; cf. 11411191 . 
As document structure often builds a hierarchy into sections, 
paragraphs and sentences, we can decompose each document 
into the vector space representation of non-overlapping and 
sequential extracts that correspond to them and store such de- 
compositions into a term-by-extract matrix (tem) (also called 
"term-location" matrix in II 91 .) The choice of the hierar- 
chy's level that the decomposition will rely on, can result in 
document representation using term-by-section (tsm), term- 
by-paragraph (tpm) or term-by-sentence (tsm) matrices. The 
common thread is that MSMs use matrices to store vector 
space representations of document's extracts. The j-th col- 
umn of such a matrix refers to the vector space representation 
(based only on term frequency) of the j-th extract of the doc- 
ument. These are features that our paper shares with 1 14| |19I . 
On the other hand, 1 14 1 addresses primarily the issue of ef- 
fective indexing - via subgraphs and document index graphs 
- for sentence-based analyses, while 1 19 1 is concerned with 
the formal framework surrounding term-location and term- 
document matrices. None of these papers, however, consid- 
ers the idea proposed herein, namely the replacement of the 
original document vectors with approximants and the effect 
of such replacements on retrieval performance. 

As Figure ^ illustrates, MSM can be used as a transi- 
tional phase before producing the document vector. Given a 
tem H £ M. mxn of a document D, we can construct its vector 
space representation that is based on an arbitrary combina- 
tion of local and/or global weighting scheme. As elements 
hij of H are based only on term frequency, the vector space 
representation of any document vector a 6 W in the tdm can 
be written asa = GHp, where G e K mxm and p £ W. Ma- 
trix G is a diagonal matrix with nonzero diagonal elements 
accounting for the global weighting scheme. Column vec- 
tor p corresponds to the local weighting scheme applied on 
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Figure 1 : Transition between matrix space and vector space. 



document D. For example, G — I corresponds to the appli- 
cation of no global weighting scheme, while p = e corre- 
sponds to "term frequency" local weighting. Hereafter, we 
will assume that transition from matrix space to vector space 
is done by applying G = I and p = e to the tem. However, our 
results can be generalized to include more complex weight- 
ing schemes. 

In the following section, we study MSM based upon 
document representation using tsm's. For simplicity, we de- 
fine a "sentence" to be text delimited between two consec- 
utive periods (".")• We do not address here the interesting 
issues involved in sentence identification (e.g. see (81I17I .1 



2.1 Text analysis based on sentences. 

tdm of rank r and let its SVD be 



A - 



- UIV T = £ aw] 



Let A denote a 



(2.1) 



where the rightmost expression is the dyadic decomposition 
and, as usual, the singular values are arranged in non- 
increasing order. We also write best^(A) for the best rank-A: 
approximation of A (we assume here that > Cfy+i): 



where Sk, k = 1 . . . m, is the vector space representation of 
the k-th sentence of document j of the collection and m the 
total number of sentences of the y'-th document. We can now 
construct the tsm of document j of our collection according 
to the following definition: 

DEFINITION 2.1. Let document D contain m sentences and 
d be its vector space representation. The term-by-sentence 
matrix of document D is the matrix 



Sd = [s\,s%, . . .,s„ 



(2.6) 



where Sk refers to the vector space representation of the k-th 
sentence ofD. 



Using the above notation, Eauation l2.5l can be written as 



a i = Aej = Sd£ 



(2.7) 



We also introduce the notion of the "term-by-sentence matrix 
for a matrix collection". For example, if we have two 
documents D\,Di, their tsm's are Sd, and So,, and the usual 
tdm from the VSM is the matrix of two columns A — [a\ , a-i\, 
then the tsm for the collection is the matrix Sc — [Sd, ,Sd 2 ]> 
where Sdj is an embedding of the original Sdj into a matrix 
with as many rows (terms) as A. In other words, we augment 
each one of the tsm's Sqj with zero rows corresponding to 
those terms in the collection's vocabulary but not present in 
Dj. In general, we have the following: 

Definition 2.2. Let C be a collection of k documents 
Di, £>2, ■■■ Dk, where the i-th document consists of mi 
sentences. The term-by- sentence matrix of the collection C 
is the matrix Sc-' 



Sc 



Sd, ,Sd? So t 



(2.8) 



best* (A) = Y,°i u i v i 



i=l 



Note also that 



(Ij=l GiUivJ). = ZUl GMVij 



and similarly for the j-th column of best^(A): 



bestjfc(A)e; = J^du^ 



(2.2) 



(2.3) 



(2.4) 



Having assumed that matrix A is a tdm, the j-th column of A 
will correspond to the vector space representation of the j-th 
document of the collection. So, we can write 



m 

k=i 



(2.5) 



where Soj is an embedding of the original Soj into a zero 
matrix with as many rows as the tdm of the VSM representa- 
tion for C. 

The MSM provides a more general framework for IR 
1141 1191 . Our objective, here, was to investigate the per- 
formance of such a scheme and evaluate it relative to LSI 
and VSM. As we show, in specific cases the method can 
achieve results similar to LSI with respect to accuracy mea- 
sures such as precision and recall, while keeping the compu- 
tational costs to the levels of simple VSM. 

The rationale of our method is that by projecting sen- 
tence vectors of tsm's onto the subspace spanned by the sin- 
gular vectors corresponding to the k largest singular values 
for some small value of k, permits us to eliminate polysemy 
and synonymy phenomena within the document. This is ac- 
complished by the rank reduction of tsm's that SVD pro- 
duces. It is obvious that, as these phenomena are eliminated 



Algorithm: Construct pseudo-tdm based on tsm 
Input: Document collection {Di, ...,D m } 
Output: Pseudo-tdm A 

I. For each document Dj: 

1 . Prepare tsm Sdj 

2. Select k' < rank(S) 

3. ay = best*' 

II. Assemble pseudo-tdm A := [a\, ...,a m ] 

Table 1 : Construction of pseudo-tdm from document collec- 
tion. 



locally in every document it will be difficult to improve on 
LSI. 

However, when the method is applied to collections with 
documents whose context is not semantically specific but 
multi-topic (e.g. documents from web pages with news) 
or to collections with large percentage of different terms 
per document, we provide experimental evidence that its 
performance surpasses classic vector space and comes close 
to LSI's performance. The objective is, when a document 
that refers to k semantically well-separated topics is given as 
input, for the projection to identify the principal directions 
of these topics. Then, by using the projected sentence 
vectors (instead of the original ones) we can transition to the 
vector space (from the MSM that uses tsm) by constructing 
approximations to the vectors of the VSM tdm according 
to some weighting scheme. We would thus be referring 
to this tdm with approximated columns as "pseudo-tdm". 
In particular, the y'-th column of matrix A is not computed 
according to Equation 12.71 but as follows: Let Dj the j- 
th document, S the corresponding tsm, r its rank and t the 
number of columns (sentences) of in Dj. Then 

r k< 

S = Gi u i v J an d bestf (S) = ®i u i v J j (2-9) 

1=1 1=1 

where k' < r, (a,-, ut, v,) are singular triplets of S with the a,'s 
arranged in decreasing order, and a^i > Cty+i. Then column 
aj of the pseudo-tdm A can be constructed as 

cij =best A ./(S)e W . (2.10) 

The steps for the sentence-oriented algorithm we are 
experimenting with are shown in Table [2 Query vectors 
q are therefore compared to the columns of the pseudo- 
tdm. Cosine similarity can be computed using the following 
formula: 



IM2IMI2 



Figure |2 depicts schematically how MSM can be used 
to develop a new IR method. Transitional representation 
of a document in the matrix space permits the application 
of matrix transformations (in our case, approximation via 
SVD) before producing the vector that will represent the 
document in the tdm. Each approximated document vector 
is "assembled" from structures that are local to the document 
(substructures). 

Document __. , _ Matrix model 



Document's matrix 
j. " " 

Best(k) / 
a pp roximation of ■ i £VD d600rtl(KM | tks „ 

document's matrix 

1 

I 

I 

^ Approximated document 

Local / global weight] ng schema vector 

y, -p- 



Figure 2: Using MSM as a tool for developing a new IR method. 

It is worth noting that in the algorithm presented in Table 
^ each k' is selected to be k' < r, where r is the rank of the 
particular tsm. Therefore, if we choose k! = r, the resulting 
vector becomes identical to the one obtained in the tdm of 
the classical VSM (before any weighting); furthermore, there 
is no need to perform an SVD of the tsm. Therefore, if 
this choice is made for every document, the whole pseudo- 
tdm reverts to the usual tdm, highlighting that MSM is a 
generalization of VSM. 

2.2 Computational costs. As with LSI, the method also 
relies on the (partial) SVD. The difference with LSI is that 
there are multiple SVD's, one per tsm for each document. 
Note that the tsm's can differ widely in size. Furthermore, 
the number of rows of each tsm will typically be much 
smaller than the full tdm since the number of terms in each 
sentence is expected to be much smaller than the total num- 
ber of terms in the collection. More importantly, even though 
the approximate term-document vector resulting from this 
process for each document might be far less sparse than 
the term-document vector corresponding to the original tdm, 
many zeros will be introduced at the embedding phase, to 
take into account terms that are present in other documents 
but not this one. It is also worth noting, though we leave 
it for future study, that the SVD's are independent for each 
document and hence can be processed in a distributed man- 
ner. It can happen, of course, that the number of sentences 
in each document can be large, even larger than the number 
of documents in the entire collection. To address this, we 
can make use of the flexibility of the proposed methodology, 



and adjust our analysis at any level of the document's hierar- 
chy that is convenient (sentences, paragraphs, sections, ...)■ 
This is work in progress and we plan to report on it in the 
future. Table [2] illustrates indicative sizes for the tsm's re- 
sulting from the MEDLINE datasets. These tsm's appear to 
be small enough compared to the tdm and therefore the appli- 
cation of our method to them results on computational costs 
very close to that of VSM. We finally note that another ad- 
vantage of this approach is that as the approximation is per- 
formed locally for each document, the method does not entail 
significant costs when performing document updates. In par- 
ticular, the update of the pseudo-tdm using a new document 
will only cause a non-trivial change on existing document 
vectors if terms that already existed in these documents but 
were not accounted for till then, e.g. because of low global 
frequency. In that case, we would need to update, in some 
way, the SVD of the tsm of each affected document. 



MED 


terms 


sentences 


total 


total 


# 


per doc. 


per doc. 


terms 


docs. 


1 


45 (0.81%) 


6 (0.58%) 


5526 


1033 


2 


90(1.62%) 


13 (0.62%) 


5526 


2066 


3 


135 (2.44%) 


19 (0.61%) 


5526 


3099 


4 


180 (3.25%) 


24 (0.58%) 


5526 


4132 


5 


225 (4.07%) 


28 (0.54%) 


5526 


5165 


6 


270 (4.88%) 


34 (0.55%) 


5526 


6198 


7 


315 (5.70%) 


40 (0.55%) 


5526 


7231 


8 


360 (6.51%) 


47 (0.57%) 


5526 


8264 


9 


405 (7.32%) 


52 (0.56%) 


5526 


9297 


10 


450 (8.14%) 


58 (0.56%) 


5526 


10330 



Table 2: Example sizes for tsm's (average number of terms 
and sentences per document for the datasets we used). 



3 Analysis 

To shed further light into the nature of the matrix resulting 
from tsm's, in this section, we study the behavior of the 
singular values of a collection's pseudo-tdm that has been 
constructed using this approach. We show that pseudo- 
tdm's preserve the so-called low-rank-plus shift property of 
tdm's(|22|). Note that this property can be put to practical 
use to enhance the effectiveness of update algorithms (|27|). 
Even though we do not study such update schemes here, it is 
useful to know that they could be applied on pseudo-tdm's. 

Definition 3.1. (Low rank plus shift structure) 
A matrix A £ JJ" 1 *" ( s sa id to have "low rank plus shift" 
structure if it satisfies: 

— «CWC T +C 2 / (3.12) 
m 

where C £ MP and matrix W £ M. kxk is a symmetric positive 
definite matrix. 



When A represents a collection's tdm, C £ l" xt is the ma- 
trix whose columns represent latent concepts of the collec- 
tion. The use of the terminology "low rank plus shift" comes 
from the fact that in IR applications, k <C min{m,«}. The 
singular values of such matrices have the following distri- 
bution: The first few singular values are large but decrease 
rapidly and then the curve becomes flat but not necessarily 
zero. In order to determine if a matrix satisfies this property, 
we follow the analysis of Zha and Zhang (|27|) and inves- 
tigate the following matrix approximation problem: Given 
a rectangular matrix, what is the closest matrix that has the 
low rank plus shift property. We can then define a matrix set 
for a given k > 0, 

J k = {Be K mx " I oi(fi) > ... > c min{m „ } (fl), 

o k+1 (B) = --- = o mhl{mM (B)} (3.13) 

Using this notation, the matrix approximation problem is 
reduced to finding the distance between a general matrix A 
and the set J k . The next theorem provides as such a solution. 

Theorem 3.1. (Zha and Zhang IE3) Let the SVD of 
A be A — UTV T , £ = diag[o!, . . . ,a min { m „}] and U, V 
orthogonal. Then for k < min{m,«} we have: 

arg min||A -J\\ p = [/ l S i V l T +x/(V t i ) T , 
JeJt 

where Z k = dmg(Oi,...,C k ),U = [Uk, V = [V*, V^~} and 
p refers to either the Frobenius (p = F) or the spectral 
(p = 2) norm. Furthermore, 

minim, 17) 

^ C, 

;4+i (min{m,«}-fc) 

and 

_ Gk+l +g m in{m,n} 



Using the above theorem, we can examine experimentally 
how close is a given tdm to the set of matrices with the "low 
rank plus shift" structure. For our experimental analysis, 
we used the MEDLINE dataset that contains relatively small 
documents and one topic per document. We also constructed 
artificial, additional datasets based on MEDLINE, so as 
to test the performance of our method when applied to 
multi-topic documents. The documents of these datasets 
(MED_1, MED_2, ...,MED_10) consist of joint documents of 
original MEDLINE. Documents of MED_i dataset, contain ; 
MEDLINE documents; MED_1 is identical to MEDLINE. 



Table [3] shows the value of quantity ^ A be ^y.^ A ^ F f° r 
different tdm's (pseudo-tdm and the usual VSM tdm). Ac- 
cording to Theorem l3.ll the smaller this value is for a given 
matrix, the closer to low-rank-plus-shift structure the matrix 
is. The notation we use for the naming of the datasets is of 
the form NAME_i_j, where NAME is the dataset's name, i 
is the number of actual semantic topics per collection's doc- 
ument and j is the number of singular triplets of the term 
by sentences matrix that were used for the construction of 
approximated document vectors. 



Dataset 


VSM tdm 


pseudo-tdm 


MED.l.l 


0.6738 


0.6655 


MED_2_1 


0.5973 


0.5807 


MED.3.1 


0.5232 


0.5030 


MED.4.1 


0.4601 


0.4379 


MED_5.1 


0.4015 


0.3746 


MED.6.1 


0.3462 


0.3172 


MED_7.1 


0.2852 


0.2534 


MED_8.1 


0.2282 


0.1982 


MED.9.1 


0.1644 


0.1411 


MED_10_1 


0.0829 


0.0660 



Table 3: Value of ^ A ^|^~^ for a tdm constructed using 
VSM and our method (pseudo-tdm). 

As depicted on Table [3] the pseudo-tdm's appears 
to be closer to the "low rank plus shift" structure than 
the VSM tdm's. Furthermore, the distance becomes even 
smaller for multi-topic collections. Figure |3]depicts singular 
value distributions for classic and approximated tdm's of 
datasets MED_1 and MED_2 using kf = 1,5 singular triplets 
to approximate the tsm's. We note that the singular values 
of pseudo-tdm's are bounded by the corresponding singular 
values of VSM tdm's. We next prove that this indeed holds. 

3.1 Some spectral properties of pseudo-tdm's. In the 

sequel, when we compare the singular values of two matrices 
with the same number of rows but different number of 
columns we will count the singular values according to the 
number of rows. We first state three simple results. 

LEMMA 3.1. Let AG R mx ". Let V be orthonormal. Then 

o,(av t ) =Oi(A),i=l,...,m. (3.14) 

Lemma 3.2. Let A = [A U A2\. Then 

o i (A 1 )<G i (A),i=l,...,m. (3.15) 
Lemma 3.3. LetA,B e R mx " and q = min{m,«}. Then 

Oi(AB T ) <Ci{A)ai{B),i = \,...,q. 




Singular values 




Singular values 




Singular values 



Figure 3: Plots of singular values for the classic and the approxi- 
mated tdm that arises when one singular value of term by sentences 
matrix for MEDLINE- 1 dataset (first), five singular values for 
MEDLINE- 1 dataset (second), one singular value for MEDLINE - 
2 dataset (third) and five singular values for MEDLINE-2 dataset 



We next consider the tsm of a collection with two docu- 
ments. The following theorem comes as a generalization of 
a similar result of Zha and Simon 1 22 Theorem 3.3]. 

Theorem 3.2. Let A el 
for any k\, k 2 , we have 



We now define 



and write A = [Ai,A 2 ]. Then 



^•([bestft, (Ai),best* 2 (A 2 )]) < a, ([Ai,A 2 ]) ■ 

Proof. In the sequel, we remind that the SVD of a matrix A 
can be written as 



A = [P k ,Pt\dmg{Z k Xi){[Qk,Ql 



where Pk^k^Qk consists of the k leading left and write 
singular triplets and P^ , Y.jr , Qjr the remaining ones. Clearly, 
PjPt ar, d QkQk are matrices. Then, for i = l,...,m, 

Oi(\A h A 2 }) = 



= a.niPk^Pftdmgpk^^Ai 

= O, ([P kl Z k} ,A 2 ,P^]J (by LemmalO 

= a, ([best,, (Ai), [A 2 ,^]di ag (£* 2! £4),P^] 

= o, f[be S t lfel (A 1 ) )J P^ 2 , J P^E4,P^ i ] 

= o, ribesu, {A^P^Q^P^Pfci] 

= a ( [best,, (A^, best, 2 (A 2 ) , P^ , ] 



Noticing that [best^ (Ai), best,t 2 (A 2 )] is a submatrix of 
[best jfel (Ai),best jfe2 (A 2 ),P^E^,P^E^] we obtain the result 
by invoking Lemma l3~2l 



Using Theorem 13 .21 we can prove the following result 
for the approximate tdm constructed by our method. 

THEOREM 3.3. LetAe R mxn and write A = [A U A 2 ] where 
Ai E M mx "' andA 2 £ K mx "2. Then for any k h k% we have 

d ([best^ (AiJeW.bes^ (A 2 )e^ (3.16) 
< o,-([A ie ('"',A 2e ^]) 

Proof. Let A = [best^j (Ai),best^ 2 (A2)]. Then we have: 



[Aie {ni > ,A 2 e ( " 2) ] 



[Ai,A 2 ] 

= A 






e («2) 





.(«2) 



and 



[besU 1 (A 1 )e ( " l) ,best i:2 (A 2 )e 



(«2)l 









.(»2) 



and by invoking Lemma l33l we have: 

a ( -([best A . 1 (A 1 ) e (' , i),best t2 (A 2 ) e ("2)])< ai .(A)o 1 (z?) 

As a,(A) < a,(A) (by Theorem l3~2t the result follows. 

Theorem 13. 31 is readily generalized to provide bounds for 
the singular values of matrices A = [A\ ,A 2 , . . . Ak] that cor- 
respond to collection's tern's with k documents. 

Note that since every term of the pseudo-tdm vector 
that corresponds to the j-th document is the result of a local 
operation on the document's tsm, namely aj = best£/(S)e^, 
the construction of each element of aj can be interpreted 
as the result of a weight factor based on local information 
applied on the corresponding term of Se^'\ 



MED 


VSM 


New 


LSI 


LSI 


New w. 


# 






k = 20 


k= 100 


LSI(20) 


1 


0.0313 


0.0314 


0.0284 


0.0285 


0.0279 


2 


0.0754 


0.0728 


0.0697 


0.0815 


0.0633 


3 


0.0544 


0.052 


0.0595 


0.0503 


0.0563 


4 


0.0793 


0.0797 


0.0866 


0.0802 


0.0919 


5 


0.0809 


0.0795 


0.0888 


0.0842 


0.094 


6 


0.088 


0.0883 


0.0949 


0.0898 


0.1027 


7 


0.0866 


0.0904 


0.0962 


0.0897 


0.092 


8 


0.0862 


0.0928 


0.0938 


0.0867 


0.1116 


9 


0.1031 


0.1041 


0.1057 


0.105 


0.0992 


10 


0.1048 


0.1077 


0.1015 


0.1035 


0.1016 



Table 4: Mean precision for MEDLINE datasets. The 
pseudo-tdm vectors were computed with k 1 = 5 singular 
triplets. 



4 Experimental results 

All experiments were conducted using MATLAB 6.5 run- 
ning using Windows XP on a 2.4 GHz Pentium IV PC with 
512 MB of RAM. In all cases we compute the necessary 
singular triplets by means of the MATLAB svds function 
that is based on implicitly restarted Arnoldi 1 15 1. Our focus 
was query evaluation using Eauation l2.11l on the pseudo-tdm 
constructed via the Algorithm of Table[T] We tested the new 
method using k' = 1 and 5 tsm singular triplets to build the 
pseudo-tdm. We also used the new method in combination 
with LSI, that is applying LSI on the pseudo-tdm, to get an 
appreciation of the overall performance. These results were 



compared with simple VSM (term frequency local weight- 


MED 


VSM 


New 


LSI 


LSI 


ing) and LSI using the approximated tdm of rank k = 20, 100. 


# 






k = 20 


k = 100 


All experiments were conducted using Text to Matrix Gener- 


1 


2(7%) 


9(30%) 


9(30%) 


7(23%) 


ator (TMG), a recent MATLAB toolbox |25|. To this effect, 


2 


3(10%) 


8(27%) 


9(30%) 


9(30%) 


we also enhanced TMG's functionality to permit the creation 


3 


4(13%) 


3(10%) 


14(47%) 


8(27%) 


of tsm's and pseudo-tdm's. 


4 


4(13%) 


7(23%) 


15(50%) 


4(13%) 


Tables |4]and [5] tabulate the mean precision of VSM, LSI 


5 


4(13%) 


6(20%) 


13(43%) 


5(17%) 


(based on 20 and 100 singular triplets) and the new method 


6 


3(10%) 


7(23%) 


11(37%) 


7(23%) 


for the different MEDLINE datasets. They also illustrate the 


7 


4(13%) 


9(30%) 


13(43%) 


2(7%) 


performance of the method when it is combined with LSI 


8 


3(10%) 


5(17%) 


14(47%) 


7(23%) 


(column "New w. LSI" of Tables E 0. 


9 


2(67%) 


7(23%) 


9(30%) 


10(33%) 




10 


6(20%) 


9(30%) 


13(43%) 


1(3%) 



MED 


VSM 


New 


LSI 


LSI 


New w. 


# 






£ = 20 


k= 100 


LSI(20) 


1 


0.0313 


0.0325 


0.0284 


0.0285 


0.0283 


2 


0.0754 


0.0657 


0.0697 


0.0815 


0.0576 


3 


0.0544 


0.0527 


0.0595 


0.0503 


0.0534 


4 


0.0793 


0.077 


0.0866 


0.0802 


0.0779 


5 


0.0809 


0.0792 


0.0888 


0.0842 


0.0809 


6 


0.088 


0.0816 


0.0949 


0.0898 


0.0903 


7 


0.0866 


0.0903 


0.0962 


0.0897 


0.0948 


8 


0.0862 


0.0867 


0.0938 


0.0867 


0.103 


9 


0.1031 


0.1008 


0.1057 


0.105 


0.0949 


10 


0.1048 


0.1006 


0.1015 


0.1035 


0.1005 



Table 5: Mean precision for MEDLINE datasets. The 
pseudo-tdm vectors were computed with Id = 1 singular 
triplets. 

Tables |6] ^present the number of queries that each method 
answers with greater precision, compared to the precision 
of the other method's answers. The new method appears 
to offer significant improvements over the performance of 
VSM, while in many cases the new methods performs better 
than LSI. 

We also plot, in Figure |6] the N = 1 1 -point interpolated 
average precision for the different queries of MEDLINE 
datasets. The interpolated precision is defined as: 



N 

/■=() 



N-l 



where: 



p{x) — max{/?, | rii >xr ,i = 1 : r} 



Figure |4] illustrates the performance of the new method 
(using 5 singular triplets to approximate the tsms) compared 
to VSM. Figure [5]provides experimental results for the new 
method viewed as an alternative weighting scheme. The 
new method and VSM have similar performance on datasets 
MED.l to MED_5. However, for MED_6 to MED_10, 
the new method improves VSM. These results imply that 



Table 6: Number of queries that each method answers with 
greater precision for MEDLINE datasets. The pseudo-tdm 
vectors were computed with k' = 5 singular triplets. 



MED 


VSM 


New 


LSI 


LSI 


# 






k = 20 


fc=100 


1 


1(3%) 


15(50%) 


7(23%) 


6(20%) 


2 


4(7%) 


8(27%) 


9(30%) 


9(30%) 


3 


3(10%) 


7(23%) 


14(47%) 


6(20%) 


4 


4(13%) 


6(20%) 


14(47%) 


5(17%) 


5 


4(13%) 


6(20%) 


14(47%) 


5(17%) 


6 


4(13%) 


9(30%) 


8(27%) 


7(23%) 


7 


5(17%) 


8(27%) 


11(37%) 


4(13%) 


8 


3(10%) 


7(23%) 


13(43%) 


5(17%) 


9 


1(3%) 


8(27%) 


8(27%) 


11(37%) 


10 


7(23%) 


8(27%) 


12(40%) 


3(10%) 



Table 7: Number of queries that each method answers with 
greater precision, for MEDLINE datasets. The pseudo-tdm 
vectors were computed with k' = 1 singular triplets. 

the SVD approximation of tsm's indeed captures the topic 
directions of multi-topic documents and thus improves the 
overall IR performance. Furthermore, LSI's performance 
improves when based upon the pseudo-tdm. 
Finally, in order to gain an appreciation for the method's 
cost, we present in Figure the runtimes for performing a 
query on tdm's that correspond to classical VSM, LSI with 
values of k = 20 and 100, the new method, and finally the 
combination of LSI with the new method. In all experiments, 
times include the cost of performing the necessary partial 
SVD's. Results indicate the new method has runtimes 
similar to VSM. 



5 Conclusions 

Our theoretical and experimental results suggest that, at 
the sentence level, matrix space models that have greater 
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Figure 5: The new method as an alternative local weighting 
scheme. LSI based on the pseudo-tdm has better performance than 
LSI on the VSM tdm. 

awareness of each document's local structure and are able to 
capture additional semantic information for each document, 
can be successfully used to improve existing IR techniques. 
Our results provide significant evidence that further justify 
proposals such as those in 1141 1191 towards the use of 
matrix based models and provide additional tools for IR 
in such frameworks. We are currently studying the effects 
of enabling additional levels of analysis (not only based 
on sentences) and adding overall greater flexibility in the 
algorithm, as well as the utilization of multilinear algebra 
techniques (cf. 1 16 1) and the use of parallel processing. 

Acknowledgments. We thank the referees for their sugges- 
tions. 
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